Introduction

For this particular project, we are looking at the stock prices of three major tech companies: Apple, Facebook and Google, from the years of 2012 to 2018. We got access to the dates, opening S&P 500 indices and closing prices of these companies from Yahoo Finance.

Based on the nature of stock market, there can potentially be temporal structures when analyzing and predicting stock prices. We took the closing price as the response variable and tried to fit appropriate models with the S&P 500 indices as the mean structure to help predict the closing prices each day for each company, as well as understanding the temporal structures within.

Due to the fact that all of the companies share the same timeline and have data of their own at the same time points, we would have to fit three individual models for each of them. We will first look at some Explanatory Data Analyses for all three companies, then try to fit simpler models with no temporal structures, and finally fit and evaluate temporal models for each company. For the temporal model fitting part specifically, we will try two different methods: both auto-fitting ARIMA models, as well as models with Gaussian Process. We then compared the performances of all three types models fit by both methods.

Lastly, we wish to come to a conclusion for our questions of interest: can we predict the closing prices of stocks with the S&P 500 index solely? Is there any temporal dependency in the closing prices? Are there differences among different companies or do they share similar trends and structures in their stocks? Can this trend or model, potentially with the temporal structure, potentially be generalized to other companies? We would also have a discussion on the adequacy, potential problems with the models and provide suggestions for developing this project in the future.

Exploratory Data Analysis

summary(df)
##       Date                 Open         fb_close       google_close   
##  Min.   :2012-05-18   Min.   :1278   Min.   : 17.73   Min.   : 277.7  
##  1st Qu.:2013-11-11   1st Qu.:1765   1st Qu.: 49.50   1st Qu.: 504.1  
##  Median :2015-05-06   Median :2033   Median : 81.73   Median : 581.6  
##  Mean   :2015-05-06   Mean   :2003   Mean   : 90.37   Mean   : 641.2  
##  3rd Qu.:2016-10-25   3rd Qu.:2178   3rd Qu.:124.94   3rd Qu.: 779.6  
##  Max.   :2018-04-20   Max.   :2867   Max.   :193.09   Max.   :1175.8  
##   apple_close    
##  Min.   : 55.79  
##  1st Qu.: 81.64  
##  Median :105.80  
##  Mean   :108.11  
##  3rd Qu.:126.81  
##  Max.   :181.72
#time trend
p1 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = fb_close)) +
  geom_line() +
  ggtitle("Stock Close Price of Facebook")
p2 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = apple_close)) +
  geom_line() +
  ggtitle("Stock Close Price of Apple")
p3 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = google_close)) +
  geom_line() +
  ggtitle("Stock Close Price of Google")
p4 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = Open)) +
  geom_line() +
  ggtitle("S&P 500 Index Open")
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

#scatter plot
p1 = ggplot(data = df, aes(x = Open, y = fb_close)) +
  geom_point(size = 0.1) +
  ggtitle("Close~Open for Facebook")
p2 = ggplot(data = df, aes(x = Open, y = apple_close)) +
  geom_point(size = 0.1) +
  ggtitle("Close~Open for Apple")
p3 = ggplot(data = df, aes(x = Open, y = google_close)) +
  geom_point(size = 0.1) +
  ggtitle("Close~Open for Google")
grid.arrange(p1, p2, p3, nrow = 2, ncol = 2)

We mainly focused on three variables Date, Open (Opening S&P indices) and Close (Closing stock price) for all three companies.

The line plots of all of the closing prices of the three companies against date, as well as the S&P 500 index against date, show linear trend as time goes by. This is an expected trend as the stock market is continuously inflating, especially since the tech companies have been rapidly developing during the time period that we’re looking at. Similarly, the S&P 500 index has an increasing the linear trend. Thus, by using the indices as a predictor/mean structure, we’re able to de-trend the linear trend in the closing prices and then move on to looking at any remaining temporal structure.

Then we look at scatter plots of all three companies with the closing price against the open S&P 500 indices. Even though there is a bit of discrepancy in the trend shown by Apple, it is more or less closer to a linear trend. Thus, the scatter plots further confirmed using the S&P 500 index as the mean structure.

Method 1: Simple Linear Models

The first method is fitting a simple linear model for each individual company. We used the S&P 500 index to predict daily closing price for all three companies separately.

#Apple
apple_lm = lm(apple_close ~ Open, data = df)
summary(apple_lm)
## 
## Call:
## lm(formula = apple_close ~ Open, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.504 -12.250  -0.823  10.708  35.669 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -49.549667   2.066132  -23.98   <2e-16 ***
## Open          0.078712   0.001015   77.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.17 on 1488 degrees of freedom
## Multiple R-squared:  0.8016, Adjusted R-squared:  0.8015 
## F-statistic:  6012 on 1 and 1488 DF,  p-value: < 2.2e-16
df$apple_naive_pred = predict(apple_lm, data=df$Open)
df$apple_naive_residual = df$apple_close - df$apple_naive_pred
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = apple_close, color = "red")) +
  geom_line(aes(y = apple_naive_pred, color = "blue")) +
  xlab("time") +
  ylab("Apple Daily Close Stock Price") +
  ggtitle("Simple Linear Model")

rmse(df$apple_close, df$apple_naive_pred)
## [1] 14.16018
#Facebook
fb_lm = lm(fb_close ~ Open, data = df)
summary(fb_lm)
## 
## Call:
## lm(formula = fb_close ~ Open, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.987 -12.587   0.699  11.056  38.127 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.687e+02  2.078e+00  -81.18   <2e-16 ***
## Open         1.293e-01  1.021e-03  126.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.25 on 1488 degrees of freedom
## Multiple R-squared:  0.9152, Adjusted R-squared:  0.9151 
## F-statistic: 1.605e+04 on 1 and 1488 DF,  p-value: < 2.2e-16
df$fb_naive_pred = predict(fb_lm, data=df$Open)
df$fb_naive_residual = df$fb_close - df$fb_naive_pred
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = fb_close, color = "red")) +
  geom_line(aes(y = fb_naive_pred, color = "blue")) +
  xlab("time") +
  ylab("Facebook Daily Close Stock Price") +
  ggtitle("Simple Linear Model")

rmse(df$fb_close, df$fb_naive_pred)
## [1] 14.23986
#Google
google_lm = lm(google_close ~ Open, data = df)
summary(google_lm)
## 
## Call:
## lm(formula = google_close ~ Open, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -182.19  -23.43   14.16   40.36  163.93 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.011e+02  9.740e+00  -51.45   <2e-16 ***
## Open         5.703e-01  4.785e-03  119.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.79 on 1488 degrees of freedom
## Multiple R-squared:  0.9052, Adjusted R-squared:  0.9051 
## F-statistic: 1.42e+04 on 1 and 1488 DF,  p-value: < 2.2e-16
df$google_naive_pred = predict(google_lm, data=df$Open)
df$google_naive_residual = df$google_close - df$google_naive_pred
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = google_close, color = "red")) +
  geom_line(aes(y = google_naive_pred, color = "blue")) +
  xlab("time") +
  ylab("Google Daily Close Stock Price") +
  ggtitle("Simple Linear Model")

rmse(df$google_close, df$google_naive_pred)
## [1] 66.74942

Method 2: ARIMA Time Series Models

\[ \begin{aligned} Close_t &= ARIMA_{(p, q, d) \times (P, Q, D)_s}(Close_{t-1, \cdots}) + \beta_1 * Open_{(S\&P_500)} \end{aligned} \]

#Apple
apple.ts = auto.arima(df %>% select(apple_close), xreg = df$Open, seasonal = TRUE)
apple.ts %>% summary()
## Series: df %>% select(apple_close) 
## Regression with ARIMA(2,1,0) errors 
## 
## Coefficients:
##          ar1      ar2    xreg
##       0.0092  -0.0491  0.0053
## s.e.  0.0303   0.0262  0.0033
## 
## sigma^2 estimated as 2.676:  log likelihood=-2844.26
## AIC=5696.52   AICc=5696.54   BIC=5717.74
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.05806906 1.633794 1.149067 0.03724077 1.097862 0.9989548
##                      ACF1
## Training set -0.001345197
ggtsdisplay(apple.ts$residuals, main = "ARIMA(2,1,0)")

#Residual plot looks good and we will go with the result.
df$apple_ts_pred = c(apple.ts$fitted)
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = apple_close, color = "red")) +
  geom_line(aes(y = apple_ts_pred, color = "blue")) +
  xlab("time") +
  ylab("Apple Daily Close Stock Price") +
  ggtitle("ARIMA(2,1,0)")

rmse(df$apple_close, df$apple_ts_pred)
## [1] 1.633794
#Facebook
facebook.ts = auto.arima(df %>% select(fb_close), xreg = df$Open, seasonal = TRUE)
facebook.ts %>% summary()
## Series: df %>% select(fb_close) 
## Regression with ARIMA(2,1,2) errors 
## 
## Coefficients:
##           ar1     ar2     ma1      ma2   drift    xreg
##       -0.0576  0.8242  0.0539  -0.8907  0.0860  0.0021
## s.e.   0.0700  0.0682  0.0574   0.0571  0.0306  0.0031
## 
## sigma^2 estimated as 2.82:  log likelihood=-2881.82
## AIC=5777.64   AICc=5777.72   BIC=5814.78
## 
## Training set error measures:
##                        ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.002329128 1.675428 1.120526 -0.1057361 1.517573 0.9969723
##                       ACF1
## Training set -0.0004667567
ggtsdisplay(facebook.ts$residuals, main = "ARIMA(2,1,2)")

#After looking at the residual plot, we found spikes at period 30 and tried to see if having a seasonal AR or MA trend makes the model better.
facebook.try1 = Arima(df %>% select(fb_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 0, 1),period = 30))
ggtsdisplay(facebook.try1$residuals)

facebook.try2 = Arima(df %>% select(fb_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(1, 0, 0),period = 30))
ggtsdisplay(facebook.try2$residuals)

#However, by evaluating residual plots, adding seasonal terms don't seem to give a better performance. Plus, we can see that autocorrelation with lag 30 is about 0.1, which is relatively small. So, we decided to stick with the result from auto.arima.
df$fb_ts_pred = c(facebook.ts$fitted)
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = fb_close, color = "red")) +
  geom_line(aes(y = fb_ts_pred, color = "blue")) +
  xlab("time") +
  ylab("Facebook Daily Close Stock Price") +
  ggtitle("ARIMA(2,1,2)")

rmse(df$fb_close, df$fb_ts_pred)
## [1] 1.675428
#Google
google.ts = auto.arima(df %>% select(google_close), xreg = df$Open, seasonal = TRUE)
google.ts %>% summary()
## Series: df %>% select(google_close) 
## Regression with ARIMA(2,1,2) errors 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2   drift    xreg
##       -0.5939  -0.8918  0.5656  0.8422  0.4368  0.0666
## s.e.   0.1005   0.0927  0.1259  0.1076  0.2345  0.0197
## 
## sigma^2 estimated as 87.05:  log likelihood=-5435.1
## AIC=10884.2   AICc=10884.28   BIC=10921.34
## 
## Training set error measures:
##                      ME     RMSE      MAE         MPE    MAPE      MASE
## Training set 0.02346136 9.307922 6.106312 -0.01130659 0.96099 0.9965447
##                    ACF1
## Training set 0.00876494
ggtsdisplay(google.ts$residuals, main = "ARIMA(2,1,2)")

#From ACF/PACF plots, we found spikes and potential periodic trend so we tried to add seasonal terms to the model.
google.try1 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 1, 0),period = 7))
ggtsdisplay(google.try1$residuals)

google.try2 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(1, 1, 0),period = 7))
ggtsdisplay(google.try2$residuals)

google.try3 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 1, 1),period = 7))
ggtsdisplay(google.try3$residuals)

#However, after trying different combinations of seasonal terms, we found that ACF/PACF haven't improved much. Plus, autocorrelation value relatively low. So we decided to go with the result from auto.arima function.
df$google_ts_pred = c(google.ts$fitted)
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = google_close, color = "red")) +
  geom_line(aes(y = google_ts_pred, color = "blue")) +
  xlab("time") +
  ylab("Google Daily Close Stock Price") +
  ggtitle("ARIMA(2,1,2)")

rmse(df$google_close, df$google_ts_pred)
## [1] 9.307922

Method 3: Gaussian Process

\[ Close_t = \beta X + w_{t} \\ w_{t} \sim GP(0,\Sigma)\\ \Sigma \sim square \ exponential \] Because it takes a long time for JAGS to run large datasets, we subset the dataset to a year of data from 2017-4-21 to 2018-4-20.

Apple

apple_emp_cloud = subset %>% emp_semivariogram(apple_naive_residual,Date)
apple_emp = rbind(
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=5)  %>% mutate(binwidth="binwidth=5"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=20) %>% mutate(binwidth="binwidth=20"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=40)   %>% mutate(binwidth="binwidth=40"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=30)  %>% mutate(binwidth="binwidth=30")
)

apple_emp %>%
  ggplot(aes(x=h, y=gamma)) +
  geom_point(size = 1) +
  ggtitle("Empirical Semivariogram of Apple (binned)")+
  facet_wrap(~binwidth, nrow=2)

apple_gp_exp_model = "model{
  y ~ dmnorm(mu, inverse(Sigma))

  for (i in 1:N) {
    mu[i] <- beta[1]+ beta[2] * x[i]
  }
  
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
      Sigma[j,i] <- Sigma[i,j]
    }
  }

  for (k in 1:N) {
    Sigma[k,k] <- sigma2 + sigma2_w
  }

  for (i in 1:2) {
    beta[i] ~ dt(coef[i], 2.5, 1)
  }
  sigma2_w ~ dnorm(10, 1/25) T(0,)
  sigma2   ~ dnorm(50, 1/25) T(0,)
  l        ~ dt(0,2.5,1) T(0,) 
}"
if (file.exists("apple_gp_jags.Rdata")) {
  load(file="apple_gp_jags.Rdata")
} else {
  m = rjags::jags.model(
    textConnection(apple_gp_exp_model), 
    data = list(
      y = subset$apple_close,
      x = subset$Open,
      d = dist(subset$Date) %>% as.matrix(),
      N = nrow(subset),
      coef = coef(apple_lm)
    ),
    quiet = TRUE
  )

  update(m, n.iter=2000)

  exp_cov_coda = rjags::coda.samples(
    m, variable.names=c("beta", "sigma2", "l", "sigma2_w"),
    n.iter=2000, thin=10
  )
  save(exp_cov_coda, file="apple_gp_jags.Rdata")
}
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
  ungroup() %>%
  mutate(.variable = paste0(.variable, "[",i,"]")) %>%
  select(-i)
betas %>%
  group_by(.variable) %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales = "free_y")

params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales="free_y")

params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

params %>%
  slice(seq(1,n(),length.out=500)) %>% 
  filter(.variable == "l") %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    scale_x_log10() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

post = bind_rows(betas, params) %>%
  group_by(.variable) %>%
  summarize(
    post_mean = mean(.value),
    post_med  = median(.value),
    post_lower = quantile(.value, probs = 0.025),
    post_upper = quantile(.value, probs = 0.975)
  )
knitr::kable(post, digits = 5)
.variable post_mean post_med post_lower post_upper
beta[1] 59.63543 55.80959 37.46380 89.79846
beta[2] 0.03976 0.04138 0.02815 0.04903
l 0.09253 0.09374 0.07121 0.11468
sigma2 48.76159 48.84577 39.31196 56.76181
sigma2_w 2.96200 2.92310 2.34422 3.71781
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(apple_gp_resid = apple_close - beta0 - beta1 * Open)

reps=1000
x = subset$Open
y = subset$apple_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o  = sq_exp_cov(dist_o,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p  = sq_exp_cov(dist_p,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu  = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
apple_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
apple_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
  geom_line(aes(y=apple_close)) +
  geom_ribbon(data=apple_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
  geom_line(data=apple_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) + 
  ylab("Apple Daily Close Stock Price")

#use mean as the predicted value from bayes to calculate RMSE
rmse(subset$fb_close,apple_pred_df_bayes$post_mean)
## [1] 10.59504

Facebook

fb_emp_cloud = subset %>% emp_semivariogram(fb_naive_residual,Date)
fb_emp = rbind(
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=1)  %>% mutate(binwidth="binwidth=1"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=5) %>% mutate(binwidth="binwidth=5"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=15)   %>% mutate(binwidth="binwidth=15"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=30)  %>% mutate(binwidth="binwidth=30")
)

fb_emp %>%
  ggplot(aes(x=h, y=gamma)) +
  geom_point(size = 1) +
  ggtitle("Empirical Semivariogram of Facebook (binned)")+
  facet_wrap(~binwidth, nrow=2)

fb_gp_exp_model = "model{
  y ~ dmnorm(mu, inverse(Sigma))

  for (i in 1:N) {
    mu[i] <- beta[1]+ beta[2] * x[i]
  }
  
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
      Sigma[j,i] <- Sigma[i,j]
    }
  }

  for (k in 1:N) {
    Sigma[k,k] <- sigma2 + sigma2_w
  }

  for (i in 1:2) {
    beta[i] ~ dt(coef[i], 2.5, 1)
  }
  sigma2_w ~ dnorm(10, 1/25) T(0,)
  sigma2   ~ dnorm(390, 1/200) T(0,)
  l        ~ dt(0,2.5,1) T(0,) 
}"
if (file.exists("fb_gp_jags.Rdata")) {
  load(file="fb_gp_jags.Rdata")
} else {
  m = rjags::jags.model(
    textConnection(fb_gp_exp_model), 
    data = list(
      y = subset$fb_close,
      x = subset$Open,
      d = dist(subset$Date) %>% as.matrix(),
      N = nrow(subset),
      coef = coef(fb_lm)
    ),
    quiet = TRUE
  )

  update(m, n.iter=2000)

  exp_cov_coda = rjags::coda.samples(
    m, variable.names=c("beta", "sigma2", "l", "sigma2_w"),
    n.iter=2000, thin=10
  )
  save(exp_cov_coda, file="fb_gp_jags.Rdata")
}
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
  ungroup() %>%
  mutate(.variable = paste0(.variable, "[",i,"]")) %>%
  select(-i)
betas %>%
  group_by(.variable) %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales = "free_y")

params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales="free_y")

params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

params %>%
  slice(seq(1,n(),length.out=500)) %>% 
  filter(.variable == "l") %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    scale_x_log10() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

post = bind_rows(betas, params) %>%
  group_by(.variable) %>%
  summarize(
    post_mean = mean(.value),
    post_med  = median(.value),
    post_lower = quantile(.value, probs = 0.025),
    post_upper = quantile(.value, probs = 0.975)
  )
knitr::kable(post, digits = 5)
.variable post_mean post_med post_lower post_upper
beta[1] 138.00653 137.23548 92.06659 190.34675
beta[2] 0.01138 0.01184 -0.00740 0.03109
l 0.06331 0.06302 0.05771 0.07039
sigma2 385.32748 384.58300 357.71997 417.85007
sigma2_w 4.45172 4.45002 3.73450 5.41662
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(fb_gp_resid = fb_close - beta0 - beta1 * Open)

reps=1000
x = subset$Open
y = subset$fb_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o  = sq_exp_cov(dist_o,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p  = sq_exp_cov(dist_p,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu  = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
fb_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
fb_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
  geom_line(aes(y=fb_close)) +
  geom_ribbon(data=fb_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
  geom_line(data=fb_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) + 
  ylab("Facebook Daily Close Stock Price")

#use mean as the predicted value from bayes to calculate RMSE
rmse(subset$fb_close,fb_pred_df_bayes$post_mean)
## [1] 2.027406

Google

google_emp_cloud = subset %>% emp_semivariogram(google_naive_residual,Date)
google_emp = rbind(
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=1)  %>% mutate(binwidth="binwidth=1"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=5) %>% mutate(binwidth="binwidth=5"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=15)   %>% mutate(binwidth="binwidth=15"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=30)  %>% mutate(binwidth="binwidth=30")
)

google_emp %>%
  ggplot(aes(x=h, y=gamma)) +
  geom_point(size = 1) +
  ggtitle("Empirical Semivariogram of Google (binned)")+
  facet_wrap(~binwidth, nrow=2)

google_gp_exp_model = "model{
  y ~ dmnorm(mu, inverse(Sigma))

  for (i in 1:N) {
    mu[i] <- beta[1]+ beta[2] * x[i]
  }
  
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
      Sigma[j,i] <- Sigma[i,j]
    }
  }

  for (k in 1:N) {
    Sigma[k,k] <- sigma2 + sigma2_w
  }

  for (i in 1:2) {
    beta[i] ~ dt(coef[i], 2.5, 1)
  }
  sigma2_w ~ dnorm(100, 1/100) T(0,)
  sigma2   ~ dnorm(700, 1/400) T(0,)
  l        ~ dt(0,2.5,1) T(0,) 
}"
if (file.exists("google_gp_jags.Rdata")) {
  load(file="google_gp_jags.Rdata")
} else {
  m = rjags::jags.model(
    textConnection(google_gp_exp_model), 
    data = list(
      y = subset$google_close,
      x = subset$Open,
      d = dist(subset$Date) %>% as.matrix(),
      N = nrow(subset),
      coef = coef(google_lm)
    ),
    quiet = TRUE
  )

  update(m, n.iter=2000)

  exp_cov_coda = rjags::coda.samples(
    m, variable.names=c("beta", "sigma2", "l", "sigma2_w"),
    n.iter=2000, thin=10
  )
  save(exp_cov_coda, file="google_gp_jags.Rdata")
}
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
  ungroup() %>%
  mutate(.variable = paste0(.variable, "[",i,"]")) %>%
  select(-i)
betas %>%
  group_by(.variable) %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales = "free_y")

params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales="free_y")

params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

params %>%
  slice(seq(1,n(),length.out=500)) %>% 
  filter(.variable == "l") %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    scale_x_log10() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

post = bind_rows(betas, params) %>%
  group_by(.variable) %>%
  summarize(
    post_mean = mean(.value),
    post_med  = median(.value),
    post_lower = quantile(.value, probs = 0.025),
    post_upper = quantile(.value, probs = 0.975)
  )
knitr::kable(post, digits = 5)
.variable post_mean post_med post_lower post_upper
beta[1] -498.77787 -500.77321 -505.52283 -468.68757
beta[2] 0.58398 0.58450 0.57281 0.58877
l 0.09541 0.09349 0.07905 0.12334
sigma2 697.70238 697.63588 663.18587 732.68595
sigma2_w 126.16484 126.77453 113.52041 139.03941
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(google_gp_resid = google_close - beta0 - beta1 * Open)

reps=1000
x = subset$Open
y = subset$google_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o  = sq_exp_cov(dist_o,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p  = sq_exp_cov(dist_p,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu  = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
google_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
google_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
  geom_line(aes(y=google_close)) +
  geom_ribbon(data=google_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
  geom_line(data=google_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) + 
  ylab("Google Daily Close Stock Price")

#use mean as the predicted value from bayes to calculate RMSE
rmse(subset$google_close,google_pred_df_bayes$post_mean)
## [1] 12.0692

Conclusion

Discussion